monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 2 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
+ Coverage 74.48% 74.56% +0.07%
===================================================
Files 200 200
Lines 13603 13684 +81
===================================================
+ Hits 10132 10203 +71
- Misses 3461 3471 +10
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
1a88160 to
6967a0b
Compare
|
2ea683f to
f4fab73
Compare
| 'current/noncurrent/orphan splits in v2) increments this counter once ' + | ||
| 'per slice. Multiple conductor_scan_id label values appearing at the ' + | ||
| 'same time indicate overlapping scans.', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], |
There was a problem hiding this comment.
The conductor_scan_id label creates a new time series per scan (UUID). While the 24h removeStaleBucketProcessorScanMetrics cleanup bounds prom-client process memory, every unique scan ID still produces a distinct series in the Prometheus TSDB until it becomes stale (5 min after the last scrape). With a typical lifecycle interval of ~6 min, that is ~240 distinct label values/day/pod. This is manageable but worth documenting: if the scan interval is ever shortened (e.g. 1 min), cardinality rises proportionally. Consider adding a note in the metric help string about the expected cardinality bounds and the cleanup mechanism.
— Claude Code
|
LGTM — clean, well-tested observability improvement for the lifecycle conductor pipeline. |
| } | ||
|
|
||
| listBucketdBuckets(queue, initMarker, log, cb) { | ||
| listBucketdBuckets(queue, initMarker, scanId, log, cb) { |
There was a problem hiding this comment.
Minor inconsistency: processBuckets was converted to Date.now() but the rate calculations in listBucketdBuckets (line 623, 636) and listMongodbBuckets (line 722, 792) still use new Date(). Both work, but since these methods are already touched by this PR (new scanId parameter), it would be cleaner to align them.
— Claude Code
Review by Claude Code |
bbef626 to
49a799a
Compare
| scanStarted = true; | ||
| this._currentScanId = scanId; | ||
| log.addDefaultFields({ conductorScanId: scanId }); | ||
| LifecycleMetrics.onProcessBuckets(log); |
There was a problem hiding this comment.
Moving onProcessBuckets from scan completion to scan start changes the semantics of the LifecycleLateScan alert. Previously, the alert fired when no scan completed within the threshold. Now it fires when no scan was started — so a conductor scan that starts but gets stuck indefinitely will no longer trigger this alert.
Consider adding a companion alert on the new s3_lifecycle_latest_batch_end_time metric to detect scans that start but never complete (stuck-scan detection).
— Claude Code
Review by Claude Code |
|
LGTM |
9da1c41 to
5561284
Compare
|
LGTM — well-structured monitoring addition. Scan ID propagation is consistent across conductor → bucket processor → task pipelines. The semantic change to |
|
|
LGTM — well-structured observability improvement. The scan-id propagation, per-scan metrics with 24-hour cleanup, and new alerts (stuck scan, overlapping scans, old messages) are all well-designed. A few minor observations below but nothing blocking. |
|
Well-structured PR that adds conductor scan ID tracking, new Prometheus metrics (batch duration, bucket count, per-scan message counters, message age histogram), and LifecycleStuckScan / LifecycleBucketProcessorOverlappingScans alerts. |
Issue: BB-740